Structured channel pruning has been shown to significantly accelerate inference time for convolution neural networks (CNNs) on modern hardware, with a relatively minor loss of network accuracy. Recent works permanently zero these channels during training, which we observe to significantly hamper final accuracy, particularly as the fraction of the network being pruned increases. We propose Soft Masking for cost-constrained Channel Pruning (SMCP) to allow pruned channels to adaptively return to the network while simultaneously pruning towards a target cost constraint. By adding a soft mask re-parameterization of the weights and channel pruning from the perspective of removing input channels, we allow gradient updates to previously pruned channels and the opportunity for the channels to later return to the network. We then formulate input channel pruning as a global resource allocation problem. Our method outperforms prior works on both the ImageNet classification and PASCAL VOC detection datasets.
translated by 谷歌翻译
在查询图像中检索与感兴趣的对象(OOI)在语义上相似的对象具有许多实际用例。一些示例包括修复失败,例如虚假的负面因素/阳性模型或减轻数据集中的类不平衡。有针对性的选择任务需要从大规模的未标记数据池中找到相关数据。在此规模上进行手动开采是不可行的。此外,OOI通常很小,占据图像区域的1%不到1%,被遮挡,并且在混乱的场景中与许多语义上不同的物体共存。现有的语义图像检索方法通常集中在较大尺寸的地理地标的采矿和/或需要额外的标记数据,例如带有相似对象的图像/图像对,用于带有通用对象的挖掘图像。我们在DNN功能空间中提出了一个匹配算法的快速稳固的模板,该模板从一个大的未标记数据池中检索了对象级的语义相似图像。我们将查询图像中OOI周围的区域投射到DNN功能空间以用作模板。这使我们的方法能够专注于OOI的语义,而无需额外的标记数据。在自主驾驶的背景下,我们通过将对象探测器的故障案例作为OOI评估我们的系统进行靶向选择。我们证明了其在具有2.2m图像的大型未标记数据集上的功效,并在采矿中显示出对具有小型OOI的图像的高回忆。我们将我们的方法与众所周知的语义图像检索方法进行比较,该方法也不需要额外的标记数据。最后,我们证明我们的方法是灵活的,并以一种或多种语义上不同的同时发生的OOI无缝地检索图像。
translated by 谷歌翻译
给定一个较小的培训数据集和学习算法,要达到目标验证或测试性能需要多少数据?这个问题至关重要,在诸如自动驾驶或医学成像之类的应用中,收集数据昂贵且耗时。高估或低估数据需求会带来大量费用,而预算可以避免。关于神经缩放定律的先前工作表明,幂律函数可以符合验证性能曲线并将其推断为较大的数据集大小。我们发现,这并不能立即转化为估计所需数据集大小以满足目标性能的更困难的下游任务。在这项工作中,我们考虑了一系列的计算机视觉任务,并系统地研究了一个概括功能功能的功能家族,以便更好地估算数据需求。最后,我们表明,结合调整的校正因子并在多个回合中收集会显着提高数据估计器的性能。使用我们的准则,从业人员可以准确估算机器学习系统的数据要求,以节省开发时间和数据采集成本。
translated by 谷歌翻译
Knowledge distillation facilitates the training of a compact student network by using a deep teacher one. While this has achieved great success in many tasks, it remains completely unstudied for image-based 6D object pose estimation. In this work, we introduce the first knowledge distillation method driven by the 6D pose estimation task. To this end, we observe that most modern 6D pose estimation frameworks output local predictions, such as sparse 2D keypoints or dense representations, and that the compact student network typically struggles to predict such local quantities precisely. Therefore, instead of imposing prediction-to-prediction supervision from the teacher to the student, we propose to distill the teacher's \emph{distribution} of local predictions into the student network, facilitating its training. Our experiments on several benchmarks show that our distillation method yields state-of-the-art results with different compact student models and for both keypoint-based and dense prediction-based architectures.
translated by 谷歌翻译
Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state-of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also demonstrate state-of-the-art accuracy and robustness in two downstream tasks: semantic segmentation and object detection. Code is available at: https://github.com/NVlabs/FAN.
translated by 谷歌翻译
我们介绍了ADAVIT,一种可自适应地调整视觉变压器(VIT)推理成本的方法,用于不同复杂性的图像。 Adavit通过自动减少在网络中处理的视觉变压器中的令牌数量作为推理进行的令牌的数量来实现这一目标。我们为此任务进行重新格式化自适应计算时间(ACT),扩展为丢弃冗余空间令牌。视觉变换器的吸引力架构属性使我们的自适应令牌减少机制能够加速推理而不修改网络架构或推理硬件。我们展示了ADAVIT不需要额外的参数或子网来停止,因为我们基于自适应停止在原始网络参数上的学习。我们进一步引入了与现有行为方法相比稳定培训的分布先前正则化。在图像分类任务(ImageNet1K)上,我们表明我们提出的Adavit在过滤信息丰富的空间特征和削减整体计算上产生了高效率。所提出的方法将Deit-Tiny的吞吐量提高了62%并除去了38%,只有0.3%的精度下降,优于大边距。
translated by 谷歌翻译
深度学习感知模型需要大量标记的训练数据来实现良好的性能。虽然未标记的数据很容易获得,但标签的成本是令人禁止的,可以为公司或个人创造巨大的负担。最近,自我监督已成为利用未标记数据的替代方案。在本文中,我们提出了一种新的轻量级自我监督的学习框架,可以通过最小的额外计算成本提高监督的学习性能。在这里,我们介绍了一个简单而灵活的多任务共同训练框架,将自我监督的任务集成到任何监督任务中。我们的方法利用借口任务来产生最小的计算和参数开销,并对现有培训管道的最小中断。我们通过在不同的感知模型上使用两个自我监督任务,对象检测和Panoptic分段来展示我们框架的有效性。我们的结果表明,两种自我监督任务都可以提高监督任务的准确性,同时展示与其他未标记数据一起使用时的强大域适应能力。
translated by 谷歌翻译
Panoptic semonation涉及联合语义分割和实例分割的组合,其中图像内容分为两种类型:事物和东西。我们展示了Panoptic SegFormer,是与变压器的Panoptic Semonation的一般框架。它包含三个创新组件:高效的深度监督掩模解码器,查询解耦策略以及改进的后处理方法。我们还使用可变形的DETR来有效地处理多尺度功能,这是一种快速高效的DETR版本。具体而言,我们以层式方式监督掩模解码器中的注意模块。这种深度监督策略让注意模块快速关注有意义的语义区域。与可变形的DETR相比,它可以提高性能并将所需培训纪元的数量减少一半。我们的查询解耦策略对查询集的职责解耦并避免了事物和东西之间的相互干扰。此外,我们的后处理策略通过联合考虑分类和分割质量来解决突出的面具重叠而没有额外成本的情况。我们的方法会在基线DETR模型上增加6.2 \%PQ。 Panoptic SegFormer通过56.2 \%PQ实现最先进的结果。它还显示出对现有方法的更强大的零射鲁布利。代码释放\ url {https://github.com/zhiqi-li/panoptic-segformer}。
translated by 谷歌翻译
深度神经网络对物体检测达到了高精度,但它们的成功铰链大量标记数据。为了减少标签依赖性,已经提出了各种主动学习策略,通常基于探测器的置信度。但是,这些方法偏向于高性能类,并且可以导致获取的数据集不是测试集数据的代表不好。在这项工作中,我们提出了一个统一的主动学习框架,这考虑了探测器的不确定性和鲁棒性,确保网络在所有类中表现良好。此外,我们的方法利用自动标记来抑制潜在的分布漂移,同时提高模型的性能。 Pascal VOC07 ​​+ 12和MS-Coco的实验表明,我们的方法始终如一地优于各种活跃的学习方法,在地图中产生高达7.7%,或降低标记成本的82%。代码将在接受纸张时发布。
translated by 谷歌翻译
We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perceptron (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5× smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.Preprint. Under review.
translated by 谷歌翻译